---
id: doc-summarization-catching-llm-regressions
title: Catching Regressions from Prompt Template Changes
sidebar_label: Catching Model Regressions
---

In the previous section, we attempted to improve our legal document summarizer's prompt template, but found that while the original test case that failed is now passing the `completeness` and `concision`, another test case has failed to pass the `concision` metric.

In this section, we'll **compare the results of our 2 evaluation test runs** and identify further improvements to our summarizer's hyperparameters.

:::tip
Confident AI allows you to _compare test runs side by side_, helping identify improvements and regressions so you can make the necessary adjustments to your hyperparameters.  
:::

## Comparing Evaluation Test Runs

To compare evaluation test runs, go to the **compare test runs** page of the latest test run (the test run using our improved prompt template) and select the test run ID of the first evaluation.

:::info  
A test case is highlighted in red if any metric scores **regress** and in green if there are improvements. Clicking on a metric row lets you compare the metric score and reasons side by side.  
:::

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
    marginBottom: "20px",
  }}
>
  <video width="100%" autoPlay loop muted playsInlines>
    <source
      src="https://deepeval-docs.s3.us-east-1.amazonaws.com/tutorial-legal-document-summarizer-compare-test-runs.mp4"
      type="video/mp4"
    />
  </video>
</div>

### Inspecting Concision Regression

As hypothesized, our first test case is now passing because the summary is complete. However, the third test case's concision issue is quite interesting. Specifically, when comparing the metric reasons, the previous test case stated: _"The actual output succinctly summarizes all essential elements from the original data."_ Now, it says: _"The output is overly repetitive, stating the same key information multiple times."_ It's not only lacking concision but also repetitive.

<details><summary>View metric reason changes</summary>
<p>

```python
new reason="The output is overly repetitive, stating the same key information multiple times, such as service details, penalty, confidentiality, liability, dispute resolution, and agreement terms. However, it includes essential information about parties, services, financials, confidentiality, liability, dispute resolution, and governance, promoting clarity and completeness."

old reason="The actual output succinctly summarizes all essential elements from the original data, including contract parties, services provided, response times, contract duration, payment terms, confidentiality, liability, dispute resolution, and governing law, without including any redundant information."
```

</p>
</details>

Let's try changing the model powering our legal document summarizer to see if the summaries become more robust.

```python
model = "gpt-4.o"
```

### Running Evaluations one Final Time

Changing the model resulted in all the test cases passing! However, it's important to acknowledge that our `EvaluationDataset` consists of only five documents—**an extremely small sample**. As the number of documents increases, so does the _potential for errors_. But if your model consistently passes across a larger dataset, it will demonstrate true robustness.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
    marginBottom: "20px",
  }}
>
  <video width="100%" autoPlay loop muted playsInlines>
    <source
      src="https://deepeval-docs.s3.us-east-1.amazonaws.com/tutorial-legal-document-summarizer-final-test-run.mp4"
      type="video/mp4"
    />
  </video>
</div>

In the next and final section, we'll learn how to easily create and maintain a dataset and retrieve it for evaluations with just two lines of code. This will allow you to systematically evaluate your LLM summarizer at scale. Let's begin!
